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Abstract 

A method is presented to construct goodness-of-fit statistics in many dimensions 
for which the distribution of all possible test results in the limit of an infinite number 
of data becomes Gaussian if also the number of dimensions becomes infinite. Fur- 
thermore, an explicit example is presented, for which this distribution as good as only 
depends on the expectation value and the variance of the statistic for any dimension 
larger than one. 



1 Introduction 

Goodness-of-fit (GOF) tests are designed to test the hypothesis that a sample of data is distributed 
following a given probability density function (PDF). The sample could, for example, consist of 
results of a repeated experiment, and the PDF could represent the theoretical prediction for the 
distribution of these results. The test consists of the evaluation of a function of the data, the GOF 
statistic, and the qualification of this result using the probability distribution of all possible results 
when the hypothesis is true, the test-distribution (TD). Despite the consensus that GOF tests are 
crucial for the validation of models in the scientific process, their success is mainly restricted 
to one-dimensional cases, that is, to situations in which the data-points have only one degree of 
freedom. The quest for GOF tests useful in situations where the number dim of dimensions is 
larger than one still continues [1, 2, 3]. 

In the following, we will see that the difficulty with GOF tests in many dimensions is to keep 
them distribution- free, that is, to construct them such that the TD is independent of the PDF. 1 We 
will, however, also see how GOF tests can be constructed such that the asymptotic TD, in the 
limit of an infinite sample size, has a Gaussian limit for dim — » oo for any PDF, so that it only 



'That is, for binning free tests, which we are considering. 
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depends on the expectation value and the variance of the GOF statistic this limit. Finally, we will 
encounter an explicit example for which the asymptotic TD depends, for any PDF, as good as 
only on the expectation value and the variance of the statistic for any dim > 1 . 



2 The structure of goodness-of-fit tests 

A GOF statistic is a function T N of the data sample {oh}^ constructed such that, under the 
hypothesis that the data are distributed in a space D. following the theoretical PDF P, there is a 
number such that 

lim T n ({oh} x n =1 ) = ^ • 

N— >oo 

The initial, naive, trust in its usefulness stems from the idea that, for a sample of finite size, the 
value of T N should be close to if the data are distributed following P, and that the value of T N 
is probably not so close to if the data are not distributed following P. This idea immediately 
leads to the question what is "close", which can be answered by the test-distribution (TD) 

r N 
7> N (t)= Mt-T N ({oH} x N =1 )) Yl^MduOi , (1) 

J i=1 

where each integration variable cu x runs over the whole space CI, and 6 denotes the Dirac distri- 
bution. "P N gives the probability distribution of the value of T N under the hypothesis that the data 
are indeed distributed following P. If it is very low at the value of T N for the empirical data, then 
the hypothesis that these data are distributed following P has to be rejected; it would under that 
hypothesis be very improbable to get such a value. In fact, knowledge of the value of the number 

is not necessary. One only needs to know where the bulk of the TD is. 

The evaluation of Tn ({oh}^ ) and the qualification of this result with "Pn constitute a GOF 
test. Notice that the TD is also necessary to qualify T N itself: it should consist of a peak around 
the expectation value 2 

N 

E(T N ) = ' 

J i=1 

If, for example, "P N is almost flat, then the test is useless since any data sample will lead to a 
value of T N that is equally probable and the test is not capable of distincting them. 



T N (M X N =1 ) IIPMdaH 



2.1 Difficulty in many dimensions 



The difficulty with the construction of GOF tests for dim > 1 is that it is in general very hard 
to calculate Pn. There is a way to avoid this, by using the the distribution from the case that 
P is constant. One then needs a mapping cp of the data-points such that the determinant of the 
Jacobian matrix of this mapping is equal to P: 

3X k ((p(a))) 



det 



3X L (cu] 



P(o>) 



2 If E(Tn ) 7^ too then the statistic is biased. 
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where XiJtu) is the k-th coordinate of data-point cu. Under the hypothesis that the original data 
are distributed following P, the mapped data are distributed following the uniform distribution. 
For dim = 1 , this mapping is simply given by the integrated PDF, or probability distribution 
function 



cp(cu) 



Pfco'l da/ 



since 



N 



5(t-T N ({(p(a)0} x N =1 )) JJPlcuOdcut = SU-TnIM^)) J J dcpt 



N 



i=1 



V 



uniform i 
N I 



where each integration variable cp i runs from to 1 . p™ lform is, for popular tests, known in the 



limit N — > oo. This asymptotic distribution -p™ lform is assumed not to be too different from 
puniform -p ests f or w hich this method can be applied are called distribution-free. 



2.2 Crude solution 

For dim > 1 , finding the mapping mentioned before is in general even more difficult than finding 
Vn- At least an estimate of Pn can be found using a straightforward Monte Carlo technique: one 
just has to generate 'theoretical data samples' the data-points of which are distributed following 
P and make a histogram of the values of Tn with these samples. Depending on how accessible 
the analytic structure of P is, several techniques exist for generating the theoretical samples. In 
the worst case that P is just given as a 'black box', the Metropolis-Hastings method can be used, 
possibly with its efficiency improved by techniques as suggested in [4]. Notice that one does 
not need extremely many samples, since one is, for this purpose, interested in the bulk of the 
distribution, not in the tails. 

Even with modern computer power, however, this Monte Carlo method can become very time 
consuming, especially for large N and large dim. In the next section, we will see how practical 
GOF statistics for dim > 1 can be constructed for which the asymptotic TD can be obtained in a 
more efficient way. 



3 Construction of goodness-of-fit statistics in many dimen- 
sions 

Several GOF statistics for the uniform distribution in many dimensions exist. They are called 
discrepancies [5] and intensively studied in the field of Quasi Monte Carlo integration [6, 7], 
for which one uses low-discrepancy sequences of multi-dimensional integration-points. These 
sequences give a faster convergence than expected from the common theory of Monte Carlo 
integration, because they are distributed 'more uniformly' than uniformly distributed random 
sequences; they give a GOF that is 'unacceptably good'. When dim = 1, discrepancies can be 
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used directly as GOF tests for general PDFs using the 'mapping method' mentioned before, and 
indeed, the Kolmogorov-Smirnov statistic is equivalent to the * -discrepancy, and the Cramer-von 
Mises statistic is equivalent to the L* 2 -discrepancy. 

In the following, we will have a look at the structure of discrepancies, and we will see how 
they can be deformed into GOF statistics for general PDFs. 



3.1 The structure of discrepancies 



Discrepancies anticipate the fact that, if a sequence {cuj^ is uniformly distributed in a space 
D. and N becomes large, then the average of a integrable function over the sequence should 
converge to the integral over D. of the function: 



(f) N -> (f) for N — > oo , 



where 



N 



(f}N— X f N and (f) = 



f (cu)dcu 



a 



Thus a class of functions H and a measure \x on H are chosen, and the discrepancy is defined as 



D 



N 



n 



|(f) N -(f)P>(df) 



1/T 



(2) 



So it is the integration error measured in a class of functions. For example, 7i could consist of 
indicator functions of a family S of subsets of O with \x such that for r — > oo 



D N = sup 



wies 



dcu 



In this case, the discrepancy is the maximum error made, if the volume of each subset is estimated 
using {cUij-Jl-,. Especially interesting are the quadratic discrepancies [11], for which r = 2, so 
that they are completely determined by the two-point function of (x: 



/ 1 N x 1/2 

Dn = ( NlX 1 ^'"^ ) 

i,j=1 



with 



Milm, oj 2 ) = c(u>i,o> 2 ) 



[c(tui, cu) + c(cu 2 , cu)] dcu + 



C1J 



c(tu,r|) dcudrj, 



where 



c(cui, cu 2 ) 



n 



So the discrepancy is the sum of the correlations of all pairs of data-points, measured with cor- 
relation function b. If the measure ll itself is completely determined by its two-point function, it 
is called Gaussian. 
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3.2 From discrepancies to GOF statistics 

Discrepancies are usually constructed in order to test the uniformity of sequences in a dim- 
dimensional hyper-cube [0, 1 ) dim . We are interested in more general cases, in which we want 
to test whether a sample {cuj^ of data-points is distributed in a space Q. following a given 
PDF P. We will assume that there exists an invertible mapping cp which maps these data-points 
onto points cp(au) G [0, 1 ) dim and for which the determinant J of the Jacobian matrix is known. 
The hypothesis dictates that the mapped points are distributed in the hyper-cube following (P o 
<P -1 )/(J ° ( P -1 )- We will denote this PDF by P itself from now on, the sample of mapped data- 
points by {oh}^ and the hyper-cube by D.. 

We want to use the idea, introduced before, to analise a data sample {oh}^ by looking at 
(f)N for different functions f. We will just have to keep in mind that, if {unl^l-i is distributed 
following P, then 

(f) N -> (fP) forN -> oo , 

where 'fP' denotes point-wise multiplication. This and the definition of the discrepancies lead 
us to define the statistic 



T N = 



|(fQ) N -(fQP)IV(df) 



H 



1/r 



where we inserted the function Q for flexibility. It could be absorbed in the definition of \±, but 
we prefer this formulation, in which we can stick to known examples for \i. We will see later on 
that the ideal choice for Q is 

Q = l/v / P. (3) 

We want to focus on the quadratic discrepancies for which \jl is Gaussian from now on. Like in 
[1 1], we shall define the statistic itself as an average case complexity, and not as a square-root of 
an average: 



N 



n 

N 



|(fQ) N -(fQP)| 2 ^(df) 



(4) 



«=1 



= tt Y_ Qc(oh, ojj) -2 YQc{tv h w) jP(co)da) 



N 



— N 



a. 



Qc(tu,ri) P(cu)P(ri)dcudri , 



£1 



where 



Qc(a>i,a> 2 ) = Q(a)i)Q(co 2 )c(a)i,a)2) 



(5) 



(6) 



The reason for the extra factor N becomes clear when we calculate the expectation value of T N . 
Assuming that the data-points are distributed independently following P, it is given by 



E(T N ) = 



Qc{w, cu) V{w)d<x> 



o 



ClJ 
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So it is independent of N and the statistic is not biased. In order to write down the variance, we 
shorten the notation such that the expectation value can be written as 



E(T N ) = (Qc 1)1 P 1 )-(Qc 1 , 2 P 1 P 2 ) , (7) 

and the variance is given by 

V(T N ) = (l-£)( ( (Qci,2Qci, 2 + Qc 1)1 Qc 1 , 2 )P 1 P 2 ) 



N 



- ((Qc 1 , 1 Qc2,3 + 4Qc 1)2 Qc 2)3 )P 1 P2P3; 

+ 3(Qc 1 , 2 QC3,4PlP 2 P3P4) j • (8) 



Notice that the formulation with Gaussian measures on the function class corresponds to 
a natural interpretation of the average of a square: given a sequence (u^)^ of functions, a 
sequence (cr^)^ of positive weights and a linear operation L, we have 



n=1 



M N 2 /M 2 ^ M 



n=1 



where the x n - integrals run from — oo to oo. So the square averaged over the sequence (Un)^ 
r 2 ) M 

'n.Jn=l 



and weighted with (cr^J^L-, is equal to the square averaged over the class of functions that can be 



written as linear combination of (i^jji, measured with Gaussian weights with widths (o" n )^i- 



In the formulation of the statistic in terms of the two-point function this means that (u n , cr^)^ 
gives its spectral decomposition: 

M 

c[(x> h u) 2 ) = y g^,tln(a)l)lin(a?2) • (9) 
n=1 

The sequence (Unjji, usually consists of an orthonormal basis, and several examples of decom- 
positions (u n , o"n)!^=i can be found in [11], including cases with M. = oo. One can also find 
the famous x 2 -statistic interpreted in this way there, with (unjji, a set of indicator functions of 
non-overlapping subsets of CI, and o^ = 1 /(u n ). 

A closer look at formula (5) for the GOF statistic reveals that it is highly impractical for the 
estimation of the TD with the Monte Carlo method, firstly because it is quadratic in the number 
of data-points and secondly because a rfzm-dimensional integral has to be calculated for each 
data sample. 3 One such integral evaluation can be performed within acceptable time-scale using 
Monte Carlo integration techniques, by generating integration-points w distributed following P 
and calculating the average of Y-iLi Q c { w h w). In order to make an estimate of the TD with a 
histogram, however, one would have to calculate in the order of a thousand of such integrals. 

Fortunately, the precise definition of the statistic, or more explicitly the spectral decomposi- 
tion of the two-point function, can be chosen such that the asymptotic TD becomes Gaussian 

3 The 2<f;'ra-dimensional integral does not depend on the data sample, and has to be calculated only once. 
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for dim > oo, as we will see in the next section. This indicates that, for large dim, only 
depends on the expectation value and the variance of the statistic. In section 5, we will see an 
explicit example for which P influences as good as only through the expectation value and 
the variance for any dim > 1 , even before P M looks like a Gaussian. So instead of thousands of 
Jz'm-dimensional integrals for a histogram, one only has to calculate a dim, two Idim, a 3 dim and 
a 4Jz'm-dimensional integral for the expectation value (7) and the variance (8). 



4 Calculation of the asymptotic test-distribution 

We approach the calculation of Vn through its moment generating function 

G N (z)=E(e zTN ) . 
"Pm can be recovered form Gn by the inverse Laplace transformation 



P N (t) 



dz 



27ti 



7 exp(S(t;z)) , S(t;z) =logG N (z) - tz , 



(10) 



where V runs from — ioo to ioo on the left side of any singularity of Gn. The analysis of Gn can 
be simplified by the observation that the statistic (4) does not change if we replace 

Un^Un-^UuQP) 

in the spectral decomposition, since L(u n ) = (u n Q)N — (u n QP) is invariant (remember that 
(P) = 1 ). In other words, (4) with \x Gaussian and two-point function (9) is equivalent to 



in 



N 



N 



n 



(fQ) N ix(df) = - Y_ Q(oH)QK)c(aH, a>,) , 



(11) 



t,i=i 



with \x Gaussian and two-point function 



M , 

c{a> h a) 2 ) = Y- ff n( U n 



wi)-— r ) ( u n (cu 2 ) 



Q(o>i) 



Q(a> 2 ) ; 



(12) 



With this decomposition, we can put (f QP) equal to zero under the measure. We continue in the 
spirit of [9, 11], and write 



C1J 



c(o),ti)6 n (co)6 n (ti) dcudr) , 



a 



where 
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so that, using Gaussian integration rules, we find that 



,zT N 



,v / 2i(f6 N > 



H(df) 



n 



n 



N 



]~[ e V2*/Nf(«H)Q(«H) U( df) ; 



i=1 



and 



G N (z)=E(e zTN ) = 



(PeV^ f Q) N ^(df) 



We shall restrict ourselves to the asymptotic distribution for N — > oo from now on. We find 



Goo(z) = lim G N (z) 



N— >oo 



e z < f2 Q 2p V(df) , 



n 



where we used the fact that that (f QP) can be taken equal to zero under the measure. Substituting 

f {a) )=f_ Xn (u n [iv)-!^) and ^(df) = ne-^-^== 

and applying well known Gaussian integration rules, we find 

GoJz) =det(l -2zA)- 1/2 , (13) 

with 

A n , m = o- n o m (u n u m Q 2 P) - cr n (u n QP)ff m (u m QP) . 

The asymptotic generating function is now determined up to the positions of its singularities, 
which can directly be written in terms of the eigenvalues (A n )^i of A, since 



GooU)= ( nn-2zA n ) ] 



-1/2 



(14) 



Another way to see how the eigenvalues affect the shape of the TD is by considering the cumu- 
lants, which are generated by the logarithm of the generating function: 



d k logG c 
dz k 



M 



-(z = 0) = 2 k ~ 1 (k- 1)!_^A k 



n=1 



If A would consist only of a diagonal term plus a diadic term, then the access to its eigenvalues 
would be relatively easy. Having in mind that the functions u n are orthonormal, this can be 
achieved by the choice 

Q = l/v / P , 



so that 



A n ,m = ffu 5 u,m- Cn(ttnV / P>a m (u m V / P) 



(15) 



8 



4.1 Gaussian limits 



Without loss of generality, we may assume that the weights cr n are ordered from large to small. 
Then, it is not difficult to see [11] that the eigenvalues (A rL )J^ 1 of the matrix (15) satisfy 

o"i > Ai > o 2 > A 2 > 03 > A3 > • • • > Cm-i > A M _! > o M > A M . (16) 

It is important to realize that (16) holds whatever P is. The influence P may have on the shape of 
Too is restricted to the freedom each of the eigenvalues A n has to change value between cr n and 
cr n+1 . The smallest eigenvalue is non-negative since the matrix A is positive: for any vector x 
we have 

M M , M \ 2 

y~ A n>m x n x m = ^~ aixi - I 5~ o T1 (u Tn v / P)x T 



n,m=1 



> 



M 




, M 


L 






n=1 






M 




, M 


L 

n=1 


«-[ 





M 



where the first inequality is by Schwarz, and the second one is based on the assumption that 
(u rL )^ =1 is an orthonormal (but not necessarily complete) set and (P) = 1 . 

For the case that P = 1 , it has been shown in [8] that P ro becomes Gaussian if and only if 
there is a limit for the statistic such that 

— i^p >0 . (17) 

V A 2 

Typically, this limit may be dim — > 00, as is shown in various examples. For simplicity, we 
assume that <j-\ = a 2 , which is actually the case in most examples in [8]. Using this and (16), 
it is easy to see that, if (17) holds, then also A 2 /{Y.^ o£) -» and Af/(^J^ 2 o£) -> 0, and 
that the limit holds for any P. So we may conclude that whenever the spectral decomposition is 
chosen such that o"i = a 2 and there is a limit such that 

* 0. 



Y M a 2 

2_n=1 °r 

then Voo becomes Gaussian in this limit. 



5 Example 

The following example of a GOF statistic in many dimensions is based on the diaphony [10, 11], 
and has the following spectral decomposition. The basis is the Fourier basis in dim dimensions: 

dim 

Un(cu) = ]^[u nk (X k (cu) ) , n k = 0,1,2,... , k = 1 ,2, . . . , dim , 

k=1 
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with 

u (x) = 1 , u 2n -i M = v / 2sin(27tnx) , u 2n M = v / 2cos(27tnx) , 
for n from 1 to oo. The corresponding weights are given by 

dim 1 

Oft = Yl <V With °"0 = 1 , 0"2n-l = 0"2n = - • 

k=1 n 

The two-point function is equal to 

dim 

c(cui,a> 2 ) = y g|tifi(a?i)un(a>2) = JJci(Xic(a>i) -X k (cu 2 ) ) , 

ft k=1 
i ^ — ^ — co v — oo v — OO i 

where = L ni =o Ln 2 =o ' ' ' 2^=0 and 

7t 2 

Cl (x) = 1 + y -2ti 2 (x mod 1)(1-x modi). 

The only important difference with the two-point function of the diaphony is that there the con- 
stant mode, the J/m-dimensional basis function which is equal to 1 , is missing. This makes sense 
since the diaphony is constructed in order to test the uniform distribution and the contribution 
of the constant mode cancels in (2). The advantage is that the diaphony is directly given by the 
sum of all two-point correlations between the data-points and no integrals of two-point functions 
have to be calculated. Notice that this cancellation also appears in (15): the first row and column 
of the matrix A consist of only zeros if P = 1 , since all modes except the constant mode have 
zero integral. For a general PDF these cancellations also exist, but not for a single mode, and 
hence are not of practical use. For example, f = 1 / Q cancels in (4). 

It is useful to introduce the function p which counts the number of weights with the same 
value: 

P(s) = Y 6 S|1/fffl . 
ft 

The numbers p(s) increase as function of dim. Using p, (14) and (16), the generating function 
can be written as 

-1/2 



GooW = ( f[(1 -2z/s 2 ) p(s) - 1 (1 -2zA s ) ) 

v s=1 ' 



where the numbers A s depend on the PDF under consideration, but are, following (16), restricted 
by the relation 

1/s 2 >A s > 1/(s + l) 2 . 

In order to find the probability density V^, the inverse Laplace transformation (10) has to be 
performed on Goo. The logarithm of the product can best be evaluated as described in [12], by 
extracting the first and the second order terms in z: 

oo 

logG oo (z)=Ez+IVz 2 + ^(g s (z)- g ;(0)z-ig;'(0)z 2 ) , (18) 

s=1 
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2 4 6 8 10 12 14 

Figure 1: (t) for dim = 1 and P = 1 using (19) and (18) with one term. 



( _ pM 1 log(l — 2z/s 2 ) — ^ log(1 — 2zA s ), and E and V are the expectation value 
and the variance of the statistic. For the case that P = 1 , so that A s = 1 /(s + 1 ) 2 , they can be 



where g s (z) 



calculated directly and are given by 



7t 

^■uniform ( ' ^ 



dim 



I TV 

■ . V un if orm = 2 ( 1 + — 



dim 



We want to study the influence of P on by generating the eigenvalues A s at random, uniformly 
distributed within their borders, and plotting the result. First, however, we need to find out 
how many terms in the infinite sum of (18) have to be taken into account in order to obtain 
a trustworthy result. This can be done at dim = 1 , since we know already that Voo will tend 
to look like a Gaussian for larger values of dim so that the sum must become less important. 
Furthermore, there is the advantage that at dim = 1 and P = 1 there exists a simple formula for 
the generating function: 

V / 2~7t 2 Z 



'uniform 



(z) 



sin V / 2~7t 2 z 



(19) 



In Figure 1, we present the result with this formula and with (18) using only one term. With 
10 terms, the difference between the curves is invisible and this is the number we further use. 
Results for dim = 2 are depicted in Figure 2. As expected, the curves look more 'Gaussian' than 
the 1 -dimensional curve. The crosses represent the case P = 1 , and the two continuous curves 
represent two cases with 'typpical' sets of random eigenvalues. The curves are clearly different, 
but if we go over to standardized variables, that is, if we plot 



VVTUv/Vt + E) 
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Figure 2: P M (t) for dim — 2, for P = 1 and two sets of random eigenvalues A s . 



so that the expectation value is equal to and the variance is equal to 1 , we find Figure 3, and 
we may conclude that the curves almost only depend on the expectation value and the variance. 
Again, we know that this behavior becomes only stronger for higher values of dim because of the 
Gaussian limit. We conclude that for general P can, to satisfying accuracy, be approximated 

by 

A/Vuniform/V "P™ lform ^ y/ V uni f orm / V (t — E) + E uniform j , 

where "P™ lform , E uniform and V uni f orm are the asymptotic test-distribution, the expectation value and 
the variance for the case that P = 1 . 



6 Conclusion 

We have seen how to construct practical GOF statistics to test the hypothesis that a sample of 
data is distributed following a given PDF in many dimensions, for which the asymptotic test- 
distribution in the limit of an infinite sample size becomes Gaussian in the limit of an infinite 
number of dimensions. Furthermore, we have seen an explicit example of such a statistic, for 
which the asymptotic test-distribution depends on the PDF as good as only through the expecta- 
tion value and the variance of the statistic for any number of dimensions larger than one. 
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